Library Imports
from pyspark.sql import SparkSession
from pyspark.sql import types as T
from pyspark.sql import functions as F
from datetime import datetime
from decimal import Decimal
Template
spark = (
SparkSession.builder
.master("local")
.appName("Section 2.1 - Looking at Your Data")
.config("spark.some.config.option", "some-value")
.getOrCreate()
)
sc = spark.sparkContext
import os
data_path = "/data/pets.csv"
base_path = os.path.dirname(os.getcwd())
path = base_path + data_path
pets = spark.read.csv(path, header=True)
pets.toPandas()
id | breed_id | nickname | birthday | age | color | |
---|---|---|---|---|---|---|
0 | 1 | 1 | King | 2014-11-22 12:30:31 | 5 | brown |
1 | 2 | 3 | Argus | 2016-11-22 10:05:10 | 10 | None |
2 | 3 | 1 | Chewie | 2016-11-22 10:05:10 | 15 | None |
Selecting a Subset of Columns
When you're working with raw data, you are usually only interested in a subset of columns. This means you should get into the habit of only selecting the columns you need before any spark transformations.
Why?
If you do not, and you are working with a wide dataset this will cause your spark application to do more work than it should do. This is because all the extra columns will be shuffled
between the worker during the execution of the transformations.
This will really kill you, if you have string columns that have really large amounts of text within them.
Note
Spark is sometimes smart enough to know which columns aren't being used and perform a Project Pushdown
to drop the unneeded columns. But it's better practice to do the selection first.
Option 1 - select()
(
pets
.select("id", "nickname", "color")
.toPandas()
)
id | nickname | color | |
---|---|---|---|
0 | 1 | King | brown |
1 | 2 | Argus | None |
2 | 3 | Chewie | None |
What Happened?
Similar to a sql select
statement, it will only keep the columns specified in the arguments in the resulting df
. a list
object can be passed as the argument as well.
If you have a wide dataset and only want to work with a small number of columns, a select
would be less lines of code.
Note
If the argument *
is provided, all the columns will be selected.
Option 2 - drop()
(
pets
.drop("breed_id", "birthday", "age")
.toPandas()
)
id | nickname | color | |
---|---|---|---|
0 | 1 | King | brown |
1 | 2 | Argus | None |
2 | 3 | Chewie | None |
What Happened?
This is the opposite of a select
statement it will drop an of the columns specified.
If you have a wide dataset and will need a majority of the columns, a drop
would be less lines of code.
Summary
- Work with only the subset of columns required for your spark application, there is no need do extra work.
- Depending on the number of columns you are going to work with, a
select
over adrop
would be better and vice-versa.